Goto

Collaborating Authors

 theorem 8


Contents Appendix

Neural Information Processing Systems

When the expected rewards of all arms are the same, we know that the arm with the lowest index will be chosen and thus the first K pulls will be π1 = 1,...,πK = K. We will complete the proof through induction. Suppose that the greedy pull sequence is periodic with π1 = 1,...,πK = K and πt+K = πt until time h>K. We will show that πh+1 = 1 if πh = K and πh+1 = πh + 1 otherwise. When k0 = 0 (i.e., πh = K), all arms have been pulled exactly ntimes as of time h. Therefore, by (3), at time h+ 1, arm 1 has the highest expected reward and will be chosen.



47a658229eb2368a99f1d032c8848542-Supplemental.pdf

Neural Information Processing Systems

Based on the feedback from the reviewers, we perform the following additional experiments which 0 explore the robustness of the choice of buffer size in SGD RER, choice of step sizes for GLMtron 10 and the behavior of the said algorithms with heavy tailed noise with a similar setup as in Section 7. We first perform an experimental study about the robustness of SGD RER to the choice of buffer size in Figure 3a. Notice that the performance remains the same for a large range of buffer sizes ( 100 from to 2000). However the performance degrades when the buffer size is too large ( 10000). We believe this is the case since the number of buffers decreases as the buffer size increases and the output is averaged over too few number of iterates (In the case of B = 10000, the final output is just an average of 10 iterates). Theoretically, this largest step-size is L where Lis the largest eigenvalue of -1 the Hessian. In the case of GLMtron, it was experimentally observed that if the step size was chosen 10 to be about 1.5 times the step size reported in Section 7, the iterates diverged. Quasi Newton method essentially normalizes the gradient with the inverse of the Hessian (or rather an approximation of the Hessian) in order to let it converge faster with large step sizes. In Figure 4, we consider the same system as in Section 7 but with heavy tailed noise given by the student t distribution (scale ν = 4.1) so that the 4-th moment exists but higher moments do not. The typical behavior of Forward SGD, SGD-ER, SGD-RER and Quasi Newton methods seems to be similar to that observed in the Sub-Gaussian noise case. However, GLMtron requires much smaller step sizes to ensure convergence and hence it takes much longer.


2d290e496d16c9dcaa9b4ded5cac10cc-Supplemental.pdf

Neural Information Processing Systems

This appendix contains a proofs of the results in the main text and further analysis on the two FIM estimators ˆI1(θ)and ˆI2(θ). In particular, Appendix C presents an analysis of how the FIM estimators and their covariance tensors change under reparametrization. Appendix D presents element-wise bound alternatives to those presented in Section 3.2. Appendix E explores various results using alternative norms to the Frobenius norm results of the main text. Appendix F presents an analysis on taking a linear combination of the two FIM estimators.


Inversion-Free Natural Gradient Descent on Riemannian Manifolds

arXiv.org Machine Learning

The natural gradient method is widely used in statistical optimization, but its standard formulation assumes a Euclidean parameter space. This paper proposes an inversion-free stochastic natural gradient method for probability distributions whose parameters lie on a Riemannian manifold. The manifold setting offers several advantages: one can implicitly enforce parameter constraints such as positive definiteness and orthogonality, ensure parameters are identifiable, or guarantee regularity properties of the objective like geodesic convexity. Building on an intrinsic formulation of the Fisher information matrix (FIM) on a manifold, our method maintains an online approximation of the inverse FIM, which is efficiently updated at quadratic cost using score vectors sampled at successive iterates. In the Riemannian setting, these score vectors belong to different tangent spaces and must be combined using transport operations. We prove almost-sure convergence rates of $O(\log{s}/s^α)$ for the squared distance to the minimizer when the step size exponent $α>2/3$. We also establish almost-sure rates for the approximate FIM, which now accumulates transport-based errors. A limited-memory variant of the algorithm with sub-quadratic storage complexity is proposed. Finally, we demonstrate the effectiveness of our method relative to its Euclidean counterparts on variational Bayes with Gaussian approximations and normalizing flows.


On the Reliability Limits of LLM-Based Multi-Agent Planning

arXiv.org Machine Learning

This technical note studies the reliability limits of LLM-based multi-agent planning as a delegated decision problem. We model the LLM-based multi-agent architecture as a finite acyclic decision network in which multiple stages process shared model-context information, communicate through language interfaces with limited capacity, and may invoke human review. We show that, without new exogenous signals, any delegated network is decision-theoretically dominated by a centralized Bayes decision maker with access to the same information. In the common-evidence regime, this implies that optimizing over multi-agent directed acyclic graphs under a finite communication budget can be recast as choosing a budget-constrained stochastic experiment on the shared signal. We also characterize the loss induced by communication and information compression. Under proper scoring rules, the gap between the centralized Bayes value and the value after communication admits an expected posterior divergence representation, which reduces to conditional mutual information under logarithmic loss and to expected squared posterior error under the Brier score. These results characterize the fundamental reliability limits of delegated LLM planning. Experiments with LLMs on a controlled problem set further demonstrate these characterizations.





SupplementaryMaterial

Neural Information Processing Systems

Proof of Proposition 2. If s = 0, the result is trivial. Hence, using the alternative formulationµs(X) = lnkeXks, we get that s 7 µs(X) is nondecreasing, andlims + µs(X) = ln(esssup(eX)) = esssupX. By definition of IX, φs0(X) and φs1(X) are integrable. The result follows from standard analysis of non-convex gradient descent. Hence, f(x) is inferior to the sum of both terms.